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(Mj-ntle: STATISTICAL THESAURUS, METHOD OF FORMINa SAME, AND USE THEREOF IN QUERY EXPANSION IN 
AUTOMATED TEXT SEARCHING 

(57) Abstract 

A sutistical tfaesaunis is built dynamically, from the same text collection that is 
being searched, allowing improved generation of expanded query temu. The thesaunis 
is dynamic in that thesauius records are collected, ranked, acceued, and applied 
dynamically. Theaatinis "records" arc actually fonned as indexed documents ananged 
in "collections". Hie collections are preferably distinguished based on text source. 
Each recoid has terms assembled in indexed groups which inherently reflect a ranking 
based on relevance to an initial query. After an initial query is received, the appropriate 
collection(s) of records may be searched by a conventional search and retrieval engine, 
the searches inherently returning tecords ranked by degree of relevance due the record 
indexing scheme. A reconl ranking scheme avoids contamination of relevant records 
by less relevant records. The recoid selection and the expansion query term generation 
processes are each divided Into paniUel threads. The separate threads correspond to 
respective text sources to enable tiie improved expansion query term generation to be 
"' in real lime. 




FOR THE PURPOSES OP INFORMATION ONLY 



Code, used to idcndf y State, party to U« PCT on the front page, of p«ni*lets puWtehing 
applications under the PCT. 




W097/34242 PCT/USIH/WISS 



STATISTICAL THESAURUS, METHOD OF FORMING SAME, 
AND USE THEREOF IN QUERY EXPANSION 
IN AUTOMATED TEXT SEARCHING 

5 

RArKGROTIND OF THE INVENTION 

1. field cf the Invention 

The present invention relates generally to the field of automated search and retrieval 
10 of text documents. More specifically, the invention relates to thesauri (especially statistical 
thesauri), to the stnictures of the statistical thesauri, to methods of forming the statistical 
thesauri, and to use of the statistical thesauri in query expansion. 

2. Related An 

15 It is known in the field of information retrieval that both precision and recall can be 

greatly improved when queries are expanded to contain a larger number of good search 
terms. A thesaurus can be used to increase the number of good search terms. 

A statistical thesaurus is a thesaurus which contains terms that are related to the 
headword by their co-occurrence with the headword in text. This is in contrast to a 

20 traditional thesaurus whose terms, synonyms, are related to the headword by meaning. 

Recent research has shown that a statistical thesaurus provides good search terms 
when used for query expansion, while traditional thesauri provide little improvemem and 
may actually hurt overall performance. As an example. Figure 6 illustrates synonyms 
for the headword "murder" from a traditional thesaurus, while Figure 7 illustrates the 

25 related concepts from a statistical thesaurus. 

Statistical thesauri can also provide related concepts for many terms not found in a 
tradit'. nal thesaurus, including current events. For example. Figure 8 illustrates the related 
concepts for the term "Whitewater". This meaning of the term "Whitewater" cannot be 
found in any traditional thesaurus. 

30 Therefore, a high performance statistical thesaurus is a very useful tool in an 

information retrieval system. It is to improving the formation, structure and use of 
statistical thesauri that the present invention is durected. 
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nn lIil|-|KY"FTtiF INVFNTTON 
T^e invemive st«i«ioal th«.ur»s provide, a high deg«e of perfonna,.=. .s 
scalab>e .0 muUip.. users ^ large amount of source informatior,, ^ is m,«b.e to 
specific source i«ion. The fl«aunrs works best when it is built f«»n *e «x. 
^iectionbeingsearche. In order to ™e. these re,— he — e <.ynan.K, 

being searched, ahowing intproved generation of expanded <,u«y tertns. The thesaurus ,s 
Z-n.hat.hes.urusrecordsarecoi.ec.ed, ranged. ac.»ss«..a„da^^^^^^ 
, 4saums"reco„is"areacn....yformedasi«.exeddocun^.rranged.n collect.™^ 
^coiiecionsa^preferabivdistinguishedbasedontextsourceccourtcasesversusnew 
Is versuspa.ents,.«.so for*,. Bach record has .ernrs assembled inin.,^^^^^^ 
segmems, which inherenU, reflect a ranking based on relevance «, an .nrtul ,ue^. Af.« 
antitiai^uery is received, Uteappropria^coUectionCs, of records n«y be searc^^^ 
,5 convemionai search and retrieval engine, the searches inher««.y temmtng records ranlted 
by degree of relevance due to me record indexing scheme. A record r»*mg schen* 
:::<:contan.inationofre.evantrecordsby.essre,evantrecords. -^cord^^^ 
*e expansion <,ue.y «m, generaUon processes are each divided into paraile. Uu«Kis. The 
^parate d«ads correspond to respective text sources «. enable the in,>r.ved expansK.n 
20 query term generation to be provided in teal time. 

More speciflcaily , the —n provides a dyn«nic sutisdcal ftesaurus mcludtng a 
collection o, records which conuin we.ghted term relationships. The smistical thesaunrs .s 
divid«l ««o multiple i«iexed collections based on sampled source material, and ts searched 
imeractively to construct a list of related concept for one or more expanded query terms, 
25 The invention also provides a ranking method for collecting ranking Brm 

„la.ionship records, and then deriving a final se, of relaBd concepts. The ranktag method 
uses descending tenn weights for each query term, and sums the weights to determme the 
sco«. Fur^ermore, the method ranks up to (for example) 100 records, but will use as few 
as (for example) 25 records when a large change in score occurs, or as few as (for 
30 example) 50 records when any change in score occurs. 

Moreover, the invenUon provides a statistical Aesaurus suucnire which stores the 
collection of logical term relationship records as a document in which the terms are 
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grouped by term weight in different indexed sections (segments) of the document. This 
allows a conventional document retrieval system to build the index, and create the 
candidate set of records for ranking. 

The invention further provides a method of parallel processing which involves 
dividing the statistical thesaurus into small physical col lections ,^each with its own index, 
searching the multiple collections simultaneously, and merging the search results. 

RPTPF nF5ir,RTPno N r^y tuf hrawtngs 
The invention is better understood by reading the following Detailed Description of 
the Preferred Embodiments with reference to the accompanying drawing figures, in which 
like reference numerals refer to like elements throughout, and in which; 

Figure 1 illustrates a query expansion process involving use of a statistical 
thesaurus. 

Figure 2 illustrates a preferred process for forming a statistical thesaurus. 

Figure 3 illustrates a preferred record selection method. 

Figure 4 illustrates a preferred term extraction process of the reuieval method. 

Figure 5 illustrates various parallel phases in a preferred query expansion method 
according to the present invention. 

Figure 6 illustrates synonyms for the headword "murder" from a traditional 
(meaning-based) thesaurus. 

Figure 7 illustrates related concepts from a statistical (co-occurrence-based) 
thesaurus using news-based material. 

Figure 8 illustrates related concepts for the terra "Whitewater" resulting from use of 
a statistical thesaurus. 

Figure 9 shows related concepts from a statistical thesaurus using GENFED (legal 
material) searches. 

Figure lOA illustrates an exemplary indexing scheme in a dictionary for a given 
collection, showing entries including a term in association with references to a document 
and a set of "groups" which reflect ranking of terms based on relevance. 

Figure lOB illustrates an exemplary "Top 1(X)" List of ranked records, showing a 
"collection" number (based on text source), a document number, and a score (based on 
rankings determined by group within a record). 
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UA u™ an exemplary work file record which is used in processing 
described with reference 10 FIGS. 4 and 5. hv freauencv 

Figure UB iUus»aKs . "Top 26" Terms Table showmg terms ordered by frequency 

°'"l--«- CoUecUon Map «™ a lister possible .ourc^ 
^. J, Wires, patents) in coniuncUon with respective numbers of collections present 
and lists of those collections. 

Figure 13 shows . ptocess of selecting coltections on which to base subsequent 

'"""XtriZra^s an exempUtry hardware platform on Which the invention, 
rented to the fom^tion. storage, and use of the statistical thesaurus may be implemented. 



PCrrtJS97«)3185 

WO 97/34242 

-5- 

PP T ^ II pn r>F<trWTP T TnN thf. PT?FFF,T^1^F,D EMBODTMRNTS 
In describing preferred embodiments of the present invention illustrated in the 
drawings, specific terminology is employed for the sake of clarity. However, the invention 
is not intended to be limited to the specific terminology so selected, and it is to be 
understood that each specific element includes all technical equivalents which operate in a 
similar manner to accomplish a similar purpose. 

A static thesaurus is built once, and accessed many times. For example, a standard 
thesaurus is published in the form of a book, with a fixed set of synonyms for each 
headword. All of the work is done before the text is published. 

For a statistical thesaurus, the related terms for a headword vary depending on the 
source text collection being searched, and over time as new material is added to the 
collection. Rebuilding a static list of related terms and headwords would be very 
computation-intensive and time consuming, limiting the ability to tune the thesaurus by 
source text collection and keep it current. 

As examples of a collection-specific sutistical thesaurus, reference is made to 
Figure 7. which is based on news material, and Figure 9, which is based on legal material. 

Query expansion takes place in real time, while the user waits at a terminal. 
Response time must therefore be very short and consistent. The present invention provides 
consistently short response time by processing an expansion request using parallel 
operations distinguished by the source text collection. 

Figure 1 illustrates a query expansion process. The process begins when the end 
user of the system enters a search query including one or more terms. The user may select 
to expand the query, as a whole for statistical retrieval, or by specific term or terms for 
boolean retrieval. If the user specifies query expansion, the list of terms to be used for 
expansion is constructed, and the statistical thesaurus is accessed. 

Unlike a traditional thesaurus, all requested terms are processed together to provide 
a single set of related concepts. The related concepts are then displayed to the user who 
can select which, if any, of the concepts to include within the query. The query can then be 
expanded again, or run by the user. 

The statistical thesaurus' strucmrc and content, as well as its method of construction 
are first described. Then, a method of retrieval of related cotKepts using the thesaurus will 
be described. 



10 
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»t«fr*.rnrds With cEch record Containing a sct of 
The statistical thesaurus IS a set of records, wiin 

T^nm Lough group 5 ,=nns are stiU meaning^ concepu wi*m bod, of 
rLr.:r.;rlv..T.reoor.n.a,...-ra....proe.s..^ 
Iilon.e.no.sucKa.a«rd^.o..in,pp.ica.o„no.03,5S9.4« 

^rr:rr:;orag.e„,e«— u.u...«^^ 

e<H,e.rion„pca:„isi«,wi*tOO*.«U,g.pp«n.H.«for«n.UcoUec.K»s»d.U«..as 

' '"";:^r:rr..are.«ugroupe-.V.«co..------ 

^,J:''T..waUeappropHare..o.r^r.sca„.a«e.e...^™^^^^ 
^eL by *eu«r. For example., f-rstse. Of records n.ay,«(onnedbasca on fed^^ 
rjuwLn»n».anda«co«.s«.«orda™aybefor»cdbase.o„n.ws«res. When 
,0 ITser ,a«r searches case .aw, a« frsr se. of record. . used for *e surisUca. ^urus 
When the user is searching news ma«rial, the s««nd s« B used. 

Figure 2 illustrates a preferred process for forming a statistical thesaurus. 
First source documents are read, me valuable terras and phrase, from the 
accumentsJre extracted, and U^saurus -records- are written. The thesaurus records .e 

essentially documems having a set of (for example) five groups (or document segments,, 
each group inherently reflecting a ranking of the terms in the group. 

The following is an example of a mesaurus -record- in a preferred embod«nent, .n 
which -murder" was d>e original query term; 
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GROUP1: 

@ felony murder rule @ @ felony murder @ 
GROUP2: 

@ murder @ @ malice @ @ harmless @ 
GR0UP3: 

@ element of malice @ @ superfluous @ @ convicted of murder @ @ malice 
instruction @ @ degree murder @ @ habeas @ 

This simple example of a record contains tenns only in the first three groups, due to the 
small size of the source document (the opinion in Davis v. State of Tennessee and Larry 
Lack, 856 F.2d 35; 1988 U.S. App. LEXIS 11941 (CA 6, 1988)). The "m" signs signal 
the beginning and end of terms to clearly delimit phrases. Of course, variations on this 
format lie within the contemplation of the present invention. 

The thesaurus records are then processed to build a statistical thesaurus index and to 
build compressed records which are optimized for use in later retrieval operations. Figure 
lOA illustrates an exemplary indexing scheme in a dictionary for a given collection, 
showing entries including a term in association with references to a document and a set of 
' groups which reflect ranking of terms based on relevance. 

The index is a typical inverted text index, as known to those skilled in the art and as 
described by Salton in his text, Automatic Text Processing. Each term that appears in any 
record appears in the index with a list of records in which it appears. 

Furthermore, the index also specifies which term group the term is in. Each record 
can be thought of as a document. Each group can be though of as a sub-portion (or 
segment) of the document such as a paragraph. The records are grouped by their source 
collection type (legal or news), and exist in many different physical collections. A physical 
collection has its own index file and compressed text file. 

FIG. 5 is a high-level illustration of the relationship of the processes of FIG. 13, 
FIG. 3, and FIG. 4. FIG. 13, FIG. 3, and FIG. 4 are now described. 

Figure 13 illustrates a preliminary process of selecting a collection which is to be 
searched in determining suitable query expansion terras. This process allows later 
processes to focus on a most appropriate source of phrases (case law, news wires, and so 
forth). First, the user-selected source is determined, and the source is looked up in a 
source map. A list of text source collections to be searched is dien output. At this point, 
the processing illustrated in FIGS. 3 and 4 can take place. 
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Figurc 12 shows a CoUection Map illustrating a list of possible text sources (court 
cases news wires, patents) in conjunction with respective numbers of collections present 
and lists of those collections. The Collection Map is referenced to select a suitable 
collection. 

Figure 3 illusttaies a preferred record seleaion method. After one or more ttims 
have been specified by the user, this medtod involves selection of a se, of records for those 
terms. 

The method preferably has two embodiments, one for boolean queries and one for 
statistical queries. The boolean version is mned for very high precision and a very small 
number of input terms, while the statistical query version is designed for a larger number 
of input terms. 

The first phase of the retrieval involves accessing the index for the provided terms 
from the statistical thesaurus index. For boolean queries, the terms are "AND«ed together, 
while the statistical queiy terms are "OR-ed together. In this phase (which may be called a 
"resolve" phase), the specific locations within the records for the query terms are read from 
the index, and merged as necessary ("OR"ed or " AND"ed multiple terms). The output 
from the merge operation is a list of records, and term locations within the record. 

The record is scored by tallying the "score" of the highest scoring location for each 
query term within the record, recognizing it is possible for a query term to appear multiple 
times within a record. Group 1 terms score 14 points, group 2 terms score 13 points, and 
so forth through group 5 terms which score 10 points. The maximum score for a record is 
14 times the number of query terms. The minimum score is 10 times the number of query 
terms for boolean queries, and 10 for statistical queries. 

After each record is scored, it is potentially inserted into a "Top 100" List, an 
illustrative example of which is shown in FIG. lOB. This list contains the highest-scoring 
100 r , ords encountered so far. The list is ordered by score, with the last entry being the 
highest-scoring entry. The current record is added to the list at the appropriate place, or 
discarded if it doesn't score high enough to make the list. The record at this poim is a 
record number and a score. 
} When all records have been processed, the resolve phase is completed. 

The List is then passed to find an ideal cutoff. Beginning with the last entry (the 
highest scoring entry), the list is processed in reverse. After 25 entries, if the score 
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changes by more than 10 points, the list is cut between these two records. After 50 entries, 
the list is cut between any change in score. This cutoff routine tends to prevent 
contamination of good entries by substantially worse entries. 

As illustrated in Figure 4, after the list is cut, the term extraction portion of the 
5 retrieval takes place. For each record in the list, the compressed text file is accessed to 

read the complete text of the record. The terms in the record are then extracted and written 
to a work file. 

An illustrative example of a work file is shown in FIG. I lA. It is understood that a 
work file is produced for each parallel thread. After all parallel threads have terminated, 

10 the results of the various work files are merged. The parallel thread implementation is 
described in greater detail with reference to FIG. 5. 

If a term is a complete match with a query term, it is not written as it is already 
known to the user. For example, if "Clinton" and "Whitewater" are already the query 
terms, the phrase "Bill Clinton" would be written, but the single term "Whitewater" or the 

15 single term "Clinton" would not be written. 

When all the records have been read and the terms have been written, the term 
extraction phase is completed, and the sort phase begins. 

The sort phase sorts the terms in the work file, and outputs the terms in alphabetical 
order. Multiple occurrences of the same term are now output consecutively. As they are 

20 output, the ft^ency of each term is calculated. A table of the top 26 terms, ranked by 
frequency, is maintained. Figure 1 IB illustrates an exemplary "Top 26" Terms Table 
showing terms ordered by frequency of occurrence. In the preferred embodiment, a term 
must have a minimum frequency of 2 to be inserted into the table. After all sorted terms 
have been output, the Term Table is output to the end user. 

25 The above method dynamically builds the list of related terms for any combination 

of query terms. Additionally, the collection of records may be supplemented at any lime 
with only a small update to the index file. 

Figure 5 illustrates various parallel phases in a preferred query expansion method 
accx)rding to the present invention. 

30 In order to achieve maximum performance, several phases in the above process are 

performed in parallel. A parallel diread is created for each physical collection being 
searched. The statistical thesaurus is preferably built in several small collections instead of 



FCTAJSSI/OSiaS 

wo 97/34142 

-10- 

a single Urge ocUecion, P^ferably. these coUeaions are based on U« source of <e«. such 

as case law or news wires. 

AS shown in FIG. 5, after the nutuber of collectiotis has been deu=rmii«d, a umque 
.oUection thread is .pawned for each collection. TlKse .hr.«U e»cu.e concurrently, mt 
is while one thread waits on a physical disk I/O, another thread may be p.«essing. Thts 
p^lelism allows the inventive process to be easily dividable acmss processors. 

Also as described with reference to HG. 4, the reading of comptessed records, and 
extraction and writing of terms a«l phrases to a work file, are also performed u. 
parallel each of the parallel paths being distinguished by the source te« coUecuon. 

Thus, the invention provides a statistical thesaunis built from multiple infonnanon 
sources. Significantly, the statistical thesaurus is managed so that many diffcrem 
combinations of tecords may be searched to cor«sp«nd to the collection specified by the 

end user. . j „ 

As a panicular example of the advantages of forming query expansion based on 
source text collection, a user of the LEXIS-NEXIS™ system may select the library 
GENFED (Which contains federal case law), the library NEWS (which contains news 
media documents), or the library PATENT (which contains the M text of U.S. patenu). 
These are examples of the source text collections mentioned above. If the user is searching 
in GENFED and the topic U MURDER, the related concepts provide better search 
performance if they are derived from federal case Uw. Conversely, the news media search 
would work better if the term is expanded using records generated from news docmtKnts. 
The difference in terms is clearly illustrated in Figures 7 and 9: Figure 7 shows related 
concepts for NEWS searches, while Figure 9 shows related concepts for GENFBD 
searches. 

IMS process is managed by sampling document collections individually, and then 
maintaining the generated term records in separate collections. Then, significantly, the 
term record collections are combmed dynamically based upon the documem collection 

being searched by the end user. 

A hardware environment in which the inventive thesaurus may be developed, stored 
0 and used is shown in FIG. 14. In particular, a document search and retrieval system 30 is 
shown. The system allows a user to search a subset of a plurality of documents for 
particular key words or phrases. The system then allows the user to view documents that 
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match the search request. The system 30 comprises a plurality of Search and Retrieval 
(SR) computers 32-35 connected via a high speed interconnection 38 to a plurality of 
Session Administrator (SA) computers 42-44. 

Each of the SR's 32-35 is connected to one or more document collections 46-49, 
each containing text for a plurality of documents, indexes therefor, and other ancillary 
data. More than one SR can access a single document collection. Also, a single SR can be 
provided access to moie than one document collection. The SR's 32-35 can be 
implemented using a variety of commercially available computers well known in the art, 
such as Model EX 100 manufactured by Hitachi Data Systems of Santa Clara, California. 

Each of the SA's 42-44 is provided access to data representing phrase and thesaurus 
dictionaries 52-54. The SA's 42-44 can also be implemented using a variety of 
commercially available computers, such as Models 5990 and 5995 manufactured by 
Amdahl Corporation of Sunnyvale California. The interconnection 38 between the SR's 
and the SA's can be any one of a number of two-way high-speed computer data 
interconnections well known in the art, such as the Model 7200-DX manufactured by 
Network Systems Corporation of Minneapolis, Minnesota. 

Each of the SA's 42-44 is connected to one of a plurality of front end processors 
56-58. The front end processors 56-58 provide a connection of the system 30 one or more 
commonly available networks 62 for accessing digital data, such as an X. 25 network, long 
distance telephone lines, aiul/or SprintNet. Connected to the network 62 are plural 
terminals 64-66 which provide users access to the system 30. Terminals 64-66 can be 
dumb terminals which simply process and display data inputs and outputs, or they can be 
one of a variety of readily available stand-alone computers, such as IBM or IBM- 
compatible personal computers. The front end processors 56-58 can be implemeitted by a 
variety of commercially available devices, such as Models 4745 and 4705 manufactured by 
the Amdahl Corporation of Simnyvale California. 

The number of components shown in FIG. 14 are for illustrative purposes only. 
The system 30 described herein can have any number of SA's, SR's, front end processors, 
etc. Also, the distribution of processing described herein may be modified and may in fact 
be performed on a single computer without departing from the spirit and scope of the 
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^ ^ .ishin. » access the system 30 via of .he — 64^ wiU use ^ 
network 62 to establish . cot^ecUon. by mea» well known in the an, to o« of the front 
end processors 56-58. front end pr«e,sors 56-58 ha«Ue communication with the user 
terminals 64-66 by providing output data for display by the temunals 64-66 and by 
processing terminal keyboard inputs entered by .he user. The data output by the front end 
processors 56-58 .ncludes text and screen commands. The front end processors 56-58 
^ screencontrol commands, such as the commonly known VTIOO commands, wh.ch 
provide screen functionality to the terminals 64-66 such as clearing the screen and movmg 
U,e cursor i.«er,ion poim. The from end processors 56-58 cn handle other known types of 
terminals and/or sund-alone computers by providing appropriate commands. 

E«h of the from end processors 56-58 communicates bidirecnooally, by means well 
taK,wn art, with its corresponding one of *e SA's 42^. It is also possible to 
configure the system, in a mam«r well known in *e an. such that one or mo« of tf.e front 
end processors can communicate with more than one of the SA' s 42-44. The fn,n. end 
processors 56-58 can be configured «. "lo-i baUuKe" the SA's 42-44 in response to da« 
flow patterns. The concept of load balancing is well known in the art. 

Each of Utt SA s 42-44 conuins an applicalion program Uiat processes search 
„^ input by a user at one of the terminals 64-66 and passes dK search request 
information onto o« or more of the SR's 32-35 which perform the search a»i returns the 
results, including *e text of the documenU. to the SA's 42^. The SA's 42-44 provide the 
user wrth .ex. documents corresponding .0 4e search resulu via d« «=rminals 64^. For a 
particular user session (i.e. a single user accessing the system via one of the terminals 64- 
66). a single one of the SA's 42-44 will interact with a user tough an appropriate one of 
flie front end processors 56-58. 

The collection selection method (FIG. 13) may be executed in eitiier the session 
admmistrator SA computers 42-44 or in the search and rettieval computers 32-35. The 
remainder of the metitods described above (HGS. 3. 4) are preferably executed on the 
search and retrieval computers 32-35. 

Of course, the inventions related to the formation, storage, and application of the 
3 statistical thesaurus may be implemented on any of a variety of computer platforms, and 
should not be limited to the example mentioned above. 
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Modifications and variations of the above-described embodiments of the present 
invention are possible, as appreciated by those skilled in the art in light of the above 
teachings. It is therefore to be understood that, within the scope of the appended claims 
and their equivalents, the invention may be practiced otherwise than as specifically 
described. 
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I A system including a dynamic statistical thesaurus for use in interactively 
generating query expansion terms in an automated text document seai.h and retrieval 
system, the system comprising: 

means for receiving at least one search query term; _ 
a plurality of collections of records having term groups addressable by an 
indexing scheme, wherein the coUections are distinguished from each other based on 
different text sample sources and the records are formed into groups with different weights 
constituting part of the indexing scheme; and 

means for using the indexing scheme to allow a user to interactively search 
the plurality of collections to generate the query expansion terms to supplement the at least 
one search query term. 

2. The system of claim 1 , wherein: 

the query term includes plural words constimting a phrase. 

3 . The system of claim 1 , wherein: 

the plurality of collections searched are a subset of a larger set of collections 
which is based on a text sample source, 

4. A ranking method for ranking term relationship records to allow a set of 
related concepts to be derived for query expansion, the ranking method comprising: 

assigning term weights for respective query terms; 
summing the weights to determine a score; 

assigning a cutoff score based on gaps between successive records; and 
accepting only records exceeding the cutoff score; 
wherein the assigning step includes assigning a cutoff score so that fewer 
records are accepted when a large gap in score occurs. 
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5. A statistical thesaurus, comprising. 

a plurality of collections of indexed records; 

wherein the records constitute respective documents, each document 
including plural terms which are grouped by weight in different indexed sections of the 
document to allow die records to be searched by a conventional text document search and 
retrieval system to add to the records and/or to form a list of related concepts for possible 
inclusion in expansion query terms. 

6. A query expansion method in a text document search and retrieval sj^tem 
using a statistical thesaurus to generate expansion query terms, the method comprising: 

dividing the statistical thesaurus into small physical collections, each 
physical collection having its own index; 

searching the multiple collections in parallel, using the respective indexes; 

and 

merging the search results to form a list of related concepts to be included in 
the expansion query terms. 
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7. The method of claim 6, wherein the dividing step includes: 

dividing the statistical thesaurus into small physical collections distinguished 
based on mumally different text sample sources. 
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